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Abstract 

Many data sets consist of variables with an inherent group structure. The 
problem of group selection has been well studied, but in this paper, we seek 
to do the opposite: our goal is to select at least one variable from each group 
in the context of predictive regression modeling. This problem is NP-hard, 
but we propose the tightest convex relaxation; a composite penalty that is a 
combination of the £1 and (.2 norms. Our so-called Exclusive Lasso method 
performs structured variable selection by ensuring that at least one variable is 
selected from each group. We study our method’s statistical properties and 
develop computationally scalable algorithms for fitting the Exclusive Lasso. 

We study the effectiveness of our method via simulations as well as using NMR 
spectroscopy data. Here, we use the Exclusive Lasso to select the appropriate 
chemical shift from a dictionary of possible chemical shifts for each molecule in 
the biological sample. 

Keywords: Structured Variable Selection, Composite Penalty, NMR Spectroscopy, 
Exclusive Lasso 
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1 Introduction 


In regression problems with a predefined group structure, we seek to accurately pre¬ 
dict the response using a subset of variables composed of at least one variable from 
each predefined group. We can phrase this structured variable selection problem as a 
constrained optimization problem where we minimize a regression loss function sub¬ 
ject to a constraint that ensures sparsity and selects at least one variable from every 
predefined group. This problem has potential applications in many areas including 
genetics, chemistry, computer science, and proteomics. Consider a motivating ex¬ 
ample from finance. In portfolio selection, the variance of the portfolio is just as 


important as the expected performance of the returns (Markowitz, 1952). Suppose 
we want to select an index fund comprised of a diverse set of 50 stocks whose per¬ 
formance approximates the performance of the S&P 500. We can ensure that we are 
selecting a diversified portfolio by requiring that we select at least one stock from 
every financial sector; selecting securities from different sectors diversifies the index 
fund and effectively lowers the variance of the return of our portfolio. We can phrase 
this strategy as a structured variable selection problem where we minimize the dif¬ 
ference in performance between the S&P 500 and our portfolio subject to selecting a 
small set of securities that is comprised of at least one security from each predefined 
financial sector. 

Even though this problem is known to be NP-hard, a popular approach in the 
literature uses convex penalties to relax similar combinatorial problems into tractable 


convex problems. While the Lasso (Tibshirani, 1996) is the most well known of these 
convex relaxations, there are several frameworks specifically designed to find convex 


alternatives to complicated structured combinatorial problems (Obozinski and Bach 


2012 Halabi and Cevher, 2014). These frameworks lead to convex penalties like the 


Group Lasso (Yuan and Lin, 2006), Composite Absolute Penalties (Zhao et al., 2009), 


and the Exclusive Lasso (Zhou et al.| 2010), the subject of this paper. Zhou et al. 


(2010) first uses the Exclusive Lasso penalty in the context of multitask learning, and 


Obozinski and Bach (2012) and Halabi and Cevher (2014) relate the penalty to their 


framework for relaxing combinatorial problems. The Exclusive Lasso penalty has not 
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yet been explored statistically or developed into a method that can be used for sparse 
regression and within group variable selection. We will develop the Exclusive Lasso 
method and study its statistical properties in this paper. 

To motivate our statistical investigation of the Exclusive Lasso for sparse regres¬ 
sion further, consider the problem of selecting one variable per group using existing 
techniques such as the Lasso or Marginal Regression. If the Lasso’s incoherence con¬ 
dition and beta-min condition are satished and Marginal Regression’s faithfulness 
assumption is satished, then both methods recover the correct variables with out any 


knowledge of the group structure (Genovese et ah, 2012 Wainwright, 2009). However, 
data rarely satishes these assumptions. Consider that if two variables are correlated 
with each other, the Lasso often selects one instead of both variables. When whole 
groups are correlated, the Lasso may only select variables in one group as opposed to 
variables across multiple groups. Similarly, if the variables most correlated with the 
response are in the same group. Marginal Regression will ignore the true variables in 
other groups. If we recall the portfolio selection example, we group variables together 
because they are correlated. In these situations, the fact that the Lasso and Marginal 
regression are agnostic to the group structure hurts their ability to select a reasonable 
set of variables across all predehned groups. If we know that this group structure is 
inherent to our problem, then complex real world correlated data motivate the devel¬ 
opment of new structured variable selection methods that directly enforce the desired 
selection across groups. 

In this paper, we investigate the statistical properties of the Exclusive Lasso for 
sparse, within group variable selection in regression problems. Specihcally, our novel 


contributions beyond the existing literature (Zhou et ah, 2010 Obozinski and Bach 


2012 Halabi and Cevher, 2014) include: characterizing the Exclusive Lasso solution 


and relating this solution to the existing statistics literature on penalized regression 
(Section 2); proving consistency and prediction consistency (Section 3); developing a 
fast algorithm with convergence guarantees for estimation (Section 4); deriving the 
degrees of freedom that can be used for model selection (Section 5); and investigating 
the empirical performance of our method through simulations (Sections 6 and 7). 
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2 The Exclusive Lasso 


Consider the linear model where the response is a linear combination of the variables 

snbject to Ganssian noise: y = X/3* + e where e is i.i.d Ganssian. For notational 

convenience, we assume the response is centered to eliminate an intercept term. We 

assume (3* is structured such that its indices are divided into non-overlapping, prede- 

hned, groups and that the support of (3* is distributed across all groups. We allow the 

support set within a group to be as small as one element and as large as the entire 

group. We can write this as two structural assumptions; (1) there exists a collection of 

non-overlapping predehned groups denoted, Q, such that U = {1,... ,p}, fl = 0 

g&Q g&Q 

and (2) the support set S of the true parameter (3* is non-empty in each group 
such that for all G ^ we have S g ^ ^ and (3* ^ 0 for all i E S. Let 
C = {f3 E : /3s 7^ 0,S n g ^ 0,V(7 G be the set of all parameters that 
satisfy our structural assumptions. 

Our goal is to hnd the element in C that best represents y using the optimization 

problem: /3 = argmin||?/ — X/3||2. Our constraint set makes this a combinatorial 

/3eC 

problem and is generally NP- hard. Instead of considering the problem as stated, we 
study its convex relaxation by replacing the combinatorial constraint with the convex 
penalty P(/3) = | X] Wl^gWi ^^st proposed in the context of document classihcation and 


g&Q 


multitask-learning (Zhou et al., 2010). Obozinski and Bach (2012) showed that the 
Exclusive Lasso penalty is in fact the tightest convex relaxation for the combinatorial 
constraint requiring the solution to contain exactly one variable from each group. 

In this paper we propose to study the Exclusive Lasso penalty in the context of 
penalized regression, looking at both the constrained version: 


/3 = argmin -||j/ — X(3\\\ subject to P{(3) < r (1) 

3 2 

where r is some positive constant and its lagrangian 

h = argmin^ll?/ - XI3\\l + X^Yl^\(3g\\l (2) 

^ g&G 

We predominantly work with the Lagrangian as they are equivalent because it is a 
convex problem. 
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Now let us understand the penalty better. For each group g, the penalty takes the 
1-norm of the parameter vector restricted to the group g, Pg, and then take the 2-norm 
of the vector of norms. If each element is its own group, the penalty is equivalent to 
ridge regression. If all elements are in the same group, the penalty is equivalent to 
squaring the 1-norm penalty. Loosely, the penalty performs selection within group 
by applying separate lasso penalties to each group. At the group level, the penalty is 
a ridge penalty preventing entire groups from going to zero. Whatever the case, the 
group structure informs the type of regularization because it is a composite penalty, 
utilizing the ii and £2 norms within and between groups respectively. 

As an illustration, consider the following toy example. Let (3* = (/^i 1 ,/9i 2;/^2 1 ) 
be our parameter such that the hrst index denotes group membership and the second 
denotes the element within group. If we evaluate the penalty at this parameter 
we have 2P{(3*) = -1- \(3l^2\Y + (/^ 2 ,i)^- visualize this example using 

the Exclusive Lasso’s unit ball as shown in Figure 1. Restricting our attention to 
variables in the same group /^h, (3^2 and setting 1 = 0 yields a unit ball equivalent 
to the ball generated by the £i norm. Alternatively, if we restrict our attention to 
variables in different groups and set (31^2 = 0 the unit ball is equivalent to 

the ball generated by the £2 norm. The geometry of simple convex penalties dictate 


the structure of the estimate in constrained least squares problems (Chandrasekaran 


et al., 2012) suggesting that if the fi-norm enforces sparsity in its estimate and that 


the ^ 2 -norm enforces density, we can expect the Exclusive Lasso to send either /3l ^ 
or f3l 2 to zero while never sending 1 to zero. 


Like the Group Lasso, studied by Yuan and Lin (2006), the Exclusive Lasso as¬ 
sumes the variables have an inherent group structure. However the Group Lasso also 
assumes that only a small number of groups represent the response y. Gonsequently, 
the Group Lasso penalty performs selection at the group level sending entire groups 
to zero. Despite their differences, both the Exclusive Lasso and the Group Lasso are 


examples of a broader class of penalties studied by Zhao et al. (2009) called Gom- 
posite Absolute Penalties. Gomposite Absolute Penalties employ combinations of £p 
norms to effectively model a known grouped or hierarchical structure. The hrst norm 
is applied to the coefficients in a group. This enforces the desired structure within 
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(a) The unit ball is equiva- (b) The unit ball is equiv- 
lent to the t'2 ball between alent to the l\ ball within (c) The Exclusive Lasso unit 
groups, enforcing density. group, enforcing sparsity. ball. 

Figure 1: The unit ball for the Exclusive Lasso penalty. The ball has properties 
of both the l\ unit ball and the the unit ball. Let /?* = /dq 2 )/^ 2 ,i) a 

parameter with two groups where the first index denotes the group and the second 
index enumerates the elements within a group. Considering the perspective where 
/^ 2 ,i = 0 yields a ball equivalent to the ball (b). Considering a perspective where 
either (51^ = Q oi ( 5 I 2 — ^ yields a ball equivalent to the £2 ball (a). 




group. The second norm is applied at the group level to the vector of group norms. 
This yields the desired structure between groups. In a sense, the Exclusive Lasso 
is the opposite of the Group Lasso. Where the Exclusive Lasso employs an £i-norm 
within group and an f' 2 -norm between groups, the Group Lasso uses an f' 2 -norm within 
group and an £i-norm between groups. Several authors have investigated some of the 


well known composite penalties. Nardi et ah (2008) study the conditions under which 


the Group Lasso correctly identifies the correct support. [Negahban and Wainwright 


(2008) study the theoretical properties of the £ 1/^00 norm penalty, a penalty similar 
to the Group Lasso. Despite the work on other composite penalties, the statistical 
properties of Exclusive Lasso have not yet been studied. 
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2.1 Optimality Conditions 


We use the first order optimality conditions to characterize the active set and derive 
two expressions for the Exclusive Lasso estimate 13. Each of these expressions offers 
insight into either the behavior of the estimate or its statistical properties. 

Because problem (1) is convex, an optimal point satisfies — X (3) + \z = Q 

where is an element of the snb gradient snch that 


e <9P(/3) 


si9n{l3i)\\j3 


i^q 111, II i^q 111 


if 7^ 0,i e 5- 

if = 0,i e ^ 


(3) 


Alternatively, we can express the snb gradient as the product of a matrix and a 
vector. If we let Mg = sign{l3gf^g)sign{l3 and let Ms be a block diagonal matrix 
with matrices Mg on the diagonal, then the sub gradient restricted to the support set 
S oi (3 will be zs = Ms /3s- 

Note that the matrix Ms depends on the support set as the block diagonal matrices 
are dehned by the nonzero elements of (3 in each gronp. 


Proposition 1. If S is the support set of j3, we can express (3 in terms of the support 
set: 


Ps = {X^Xs + XMs^X^y and Ps^ = 0 (4) 

The matrix Ms distinguishes the Exclusive Lasso from similar estimates like Ridge 
Regression. It is a block diagonal matrix that is only eqnivalent to the identity matrix 
when there is exactly one nonzero variable in each group. At this point, the Exclusive 
Lasso behaves like a Ridge Regression estimate on the nonzero indices that it has 
selected. 

Note that this characterization describes the behavior of the nonzero variables 
bnt it does not describe the behavior of the entire active set as we vary A. To derive 
a second characterization of (3, we note that the optimality conditions imply that 
every nonzero variable in the same group has an equal correlation with the residual 
Xj{y — X (3). This allows us to determine when variables enter and exit the active 
set. Recall that there is always at least one nonzero variable in each group. Another 
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variable only enters the active set once its correlation with the residual is equal to 
the correlation shared by the other nonzero variables in the same group. We call 


the set £ = 


. . \X'[{y-Xfi)\ ^ ^ 

II Pg 111 


the “weighted equicorrelation set” because of its 


resemblance to the equicorrelation set described in Efron et ah (2004). 


We can use this set to derive an explicit formula for /3. 


Proposition 2. If £ is the weighted equicorrelation set, i is in group g, and 7' is a 
vector such that 7' = || ||i — | / 3 j | then, 


= {XjXs + XI)-^[Xjy - A 7 's] and = 0 (5) 

where s G { —is a vector of signs that satisfies the optimality conditions and 
£^ is the compliment of the set £. 

The expression points to the general behavior of the penalty. For the non-zero 
indices, the hrst term is a ridge regression estimate {XjXg + XI)~^Xjy. The second 
term {XjX^ + XI)~^X'y's adaptively shrinks the variables to zero. In the case where 
the all groups have exactly one non-zero element the Exclusive Lasso estimate is a 
ridge regression estimate, ensuring that there is at least one non-zero element in each 
group. 

This characterization also helps us see that our method is not guaranteed to 
estimate exactly one non-zero element in each group. Selecting exactly one element 
from each group depends on the response y and the design matrix X. We believe 
that the degree of correlation between the columns of the design matrix impact the 
probability of selecting greater than one element per group. In comparison to other 
methods, we recover the correct structure at much higher rates, but it is possible 
to construct examples that prevent the Exclusive Lasso from estimating the correct 
structure. See the appendix for more details. 

Before proceeding we use a small simulated example to compare the behavior 
of the Lasso to the behavior of the Exclusive Lasso. We let ?/ = X/?* -f e where 
e ~ fV(0,1). The design matrix X G is multivariate normal with covariance 

that encourages correlation between groups and within groups. The incoherence 
condition is not satished with |||XjcX 5 (XjX 5 )“^|||oo = 2.603. There are hve groups 








Log(lambda) value 


Log(lambda) value 


(a) 


(b) 


Figure 2: A toy simulation with n = 20 and p = 30 consisting of five groups with 
one true variable per group. The coefficient paths of the true variables are solid 
and non-true variables are dashed lines. Each color represents a different group, 
(a) Regularization path for the Exclusive Lasso. The Exclusive Lasso behaves like 
an adaptively regularized Ridge Regression estimate sending variables to zero until 
only one variable from each group is nonzero. At this point it behaves like a Ridge 
Regression estimate, (b) Regularization path for the Lasso. The Lasso sends variables 
to zero without considering the group structure. Note that the first five variables to 
enter the model for the Lasso represent only groups 3, 4 and 5, where as the Exclusive 
Lasso has hve variables, at least one from each group, that are in the model for all A. 


and j3* is nonzero for one variable in each group. In Figure 2, we show the Exclusive 
Lasso and Lasso regularization paths for this example. In the hgure the solid lines are 
the truly nonzero variables and each color represents a different group. The Exclusive 
Lasso sends variables to zero until there is exactly one nonzero variable in each group 
whereas the Lasso eventually sends all variables to zero. Further, notice that the 
Lasso does not enforce the proper structure. The first five variables to enter the 
regularization path only represent three of the hve groups. Because of this, the Lasso 
misses several true variables. The regularization path also highlights the Exclusive 
Lasso’s connection to Ridge Regression; hve variables will never go to zero. 
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3 Statistical Theory 


The Exclusive Lasso is prediction consistent under weak assumptions. These assump¬ 
tions are relatively easy to satisfy in practice compared to the assumptions typically 
associated with sparsistency results or consistency in the t' 2 -iiorm. Throughout the 
rest of this section, we use the following notation: as before, X G denotes the 
design matrix and /3* G MP is the true parameter. We let ^ be a collection of non over¬ 
lapping groups such that U = {1, 2,... ,p} and for all g, h E Q , g H h = We let S 

g&Q 

denote the support set of (3*, meaning that for all i E S, (3* 7 ^ 0. We denote elements 
of X as Xij and we index the columns of X by group so that Xg are the columns corre¬ 
sponding to group g. Let Y* = X/3* and Y = X (3 where the vector (3 is the estimate 
produced by minimizing squared error loss subject to P{I3) < K for some constant 
K. The population mean squared prediction error is MSPE(/3) = E(y* — and the 

estimated mean squared prediction error is MSPE(/3) = Note that we 

"■1=1 

can also rewrite them so that MSPE(/5) = E||/5—/d*|||| and MSPE(/3) = ||/5—/9*||| 
where S is the covariance matrix of X. Later this allows us to compare and bound 
the f' 2 -norm coefficient error by the mean squared prediction error. 

In order to prove prediction consistency we need three assumptions: 

Assumption (1): The data X is generated by a probability distribution such that 
the columns {Xi... Xp} have covariance S and the entries of X are bounded so that 
\X,j\ < M. 

Assumption (2): The value of the penalty evaluated at the true parameter is 
bounded so that lP{f3*) < K. 

Assumption (3): The response is generated by the linear model Y = X/3* + e where 
e ~ A( 0 ,a 2 ). 

Using assumptions (1) — (3) we show that the Exclusive Lasso is prediction con¬ 
sistent. 
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Theorem 1. Under assumptions (1), (2) and (3), the population mean squared pre¬ 
diction error of (3 is bounded such that 


MSPE{f3) < 2{K+\g\)Ma 


which goes to 0 as n ^ oo. 


21og(2p) , , ini\2..2 /2plog(2p2) 


n 


+ 8{K + \g\yM^ 


n 


( 6 ) 


Our assumptions are similar to those of the Lasso. Authors have shown that 
prediction consistency for the Lasso has assumptions that are much easier to satisfy 


then assumptions for other consistency results like sparsistency (Greenshtein et ah 


2004). Like the Lasso’s prediction consistency assumptions, many data sets will 


satisfy assumption (1). If we believe the data truely arises from a linear model then 
assumptions (2) and (3) will be satisfied as well. 

Theorem 1 shows that the Exclusive Lasso is consistent in terms of the norm ||x||s- 


The result differs from the prediction consistency result in (Chatterjee, 2013) by one 


term. The group structure in the penalty appears in the bound as the cardinality 
of the collection of groups. This suggests that we can allow n, p and the number of 
groups to scale together and still ensure that the estimate is prediction consistent. 
We use this result to justify using the Exclusive Lasso for prediction when a small 
number of variables are desired in each group. 

We can also bound the estimated mean squared prediction error. 

Theorem 2. Under assumptions (1), (2) and (3) the estimated mean squared predic¬ 
tion error of /3 is bounded such that 


E[M^0)] < 2{K + \g\)Ma^l^^^^ (7) 

which goes to 0 as n ^ oo. 

Similar to Theorem 1, the Exclusive Lasso is consistent in terms of the norm HxH^ 
under weak assumptions. If we add a further assumption, we can show that the 
Exclusive Lasso is consistent using the £2 norm. 
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Corollary 1. If the smallest eigenvalue of the covariance matrix S is hounded below 
by c> then the Exclusive Lasso estimate is consistent in the i 2 -norm: 


II /3 <-{K+ +-{K+ (8) 

c \ n c \ n 

We add another assumption to establish consistency in the £2 norm. This requires 
the covariance matrix to be strictly positive dehnite which is much more restrictive 
then our previous assumptions on S. In general, our results for the Exclusive Lasso 
are comparable to the consistency results for the Lasso but differ to account for the 
additional structure in the penalty. 


4 Estimation 


Many types of algorithms exist to £t sparse penalized regression models including 
coordinate descent, proximal gradient descent, and Alternating Direction Method of 
Multipliers (ADMM). We develop our Exclusive Lasso Algorithm based on proximal 
gradient descent because it is well studied and known to be computationally efhcient. 


Roughly, this type of algorithm, popularized by Beck and Teboulle (2009), proceeds 


by moving in the negative gradient direction of the smooth loss projected onto the 
set dehned by the non-smooth penalty. These algorithms are easy to implement 
for simple penalties, because simple penalties typically have closed form proximal 
operators. 

In our case, the proximal operator associated with the Exclusive Lasso penalty is 
a major challenge as there is no analytical solution. The proximal operator for the 
Exclusive Lasso is defined as follows: 


proxp{z) = argminjil/? - z||2 + AV||/3g||? (9) 

a z 

9 

We propose an iterative algorithm to compute the proximal operator of the Exclusive 
Lasso penalty, prove that this algorithm converges, and prove that the proximal 
gradient descent algorithm based on this iterative approach converges to the global 
solution of the Exclusive Lasso problem. 

First, we propose an algorithm to compute the proximal operator. 
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Lemma 1. For proximal operator proxp{z) where P is our Exlcusive Lasso penalty, 
if S{z,X) = sign{z){\z\ — A)+ and ftp = , ftp) then the coor¬ 

dinate wise updates are: 




A 


1 +A"*’^’ 1 +A 




)• 


( 10 ) 


Notice that each coordinate update depends on the other coordinates in the same 
group. Because of this, we can implement this in parallel over the groups. At each 
step, instead of cyclically updating all of the coordinates we update each group in 
parallel by cyclically updating each coordinate in a group. If there are a large number 
of groups or the data is very large, this can help speed up the calculation of the 
proximal operator. This is important in the context of our proximal gradient descent 
algorithm because the proximal operator is calculated at each step of the proximal 
gradient descent method. Empirically, we have observed that coordinate descent is 
an efficient way to calculate the proximal operator. However, we still need to prove 
that our algorithm converges to the correct solution. 

Note that because our penalty is non-separable in /3, we cannot invoke standard 
convergence guarantees for coordinate descent schemes without additional investiga¬ 
tion. Nevertheless, we can guarantee our algorithm converges and defer the proof to 
the appendix: 


Theorem 3. The coordinate descent algorithm converges to the global minimum of 
the proximal operator optimization problem given in equation I®. 

We are now ready to derive a proximal gradient descent algorithm to estimate 
the Exclusive Lasso using the coordinate descent algorithm described above. As the 
negative gradient of our ^2 regression loss is —X'^fy — Xft), our proximal gradient 
descent update is = proxp{ft^ — pX'^Xjd^ — X'^y)), where L = Amax(-A^X) 
is the Lipschitz constant for |||/ — (see appendix). Note that this step and 

Lipschitz constant are the same for all regression problems that use an ^ 2 -norm loss 
function. Putting everything together, we give an algorithm outline for our Exclusive 
Lasso estimation algorithm in Algorithm 1. 

Next, we prove convergence of Algorithm 1. Note that we never calculate the 
proximal operator exactly. Our coordinate descent algorithm solves the proximal 
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Algorithm 1: EXCLUSIVE Lasso ALGORITHM to fit the Exclusive Lasso 


Input: eW,e e 


i,6 e 


Output: /3 e MP 
1 while — l3^\\ > e do 

3 In parallel for each g: 

4 Initialize Pg G 

5 while -/ 3 ‘|| >(5 do 

for i i — 1 to Pg do 


A 




9 return /3 


operator optimization problem to within an arbitrarily small error. We need to ensure 
that the proximal gradient descent algorithm converges despite this sequence of errors 
{cfc}. We can show that as long as the sequence of errors converges to zero, the 
proximal gradient descent algorithm will converge. 

Theorem 4. Given objective function f{(3) = ^\\y — X(3\\ + XP{(3) the sequence 
of iterates {A^} generated by our proximal gradient descent algorithm converges in 
objective function at a rate of at least 0{l/k) when the sequences {||efc||} and 
are summable. 


Overall, this particular algorithm compares well to ISTA, the proximal gradient 
descent algorithm for the Lasso ( |Beck and TebouHe 2009). Although computing the 
proximal operator is more complicated due to the structure of the penalty, the con¬ 
vergence rate is the same order as the convergence rate for ISTA. The fact that the 
iterates are easy to compute and the convergence results are competitive reinforce 
our empirical observations; despite the additional structure, the Exclusive Lasso Al¬ 
gorithm compares well to hrst order methods for the Lasso and other penalized re¬ 
gression problems. 
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5 Model Selection 


In practice, we need a data-driven method to select A and regulate the amount of 
sparsity within group. To this end, we provide an estimate of the degrees of freedom 
that will allow us to use BIC and EBIC approaches for model selection. Note that 
while other general model selection procedures like cross validation and stability se¬ 
lection can be employed, these do not perform well for the Exclusive Lasso. Like the 
Lasso, cross validation tends to overselect variables. Similarly, we observe stability 
selection overselect variables, possibly because the Exclusive Lasso always selects at 
least one variable per group. If a true variable is not in the model, it is necessary 
replaced by a false variable leading to artihcially high probabilities of inclusion and 
stability scores for false variables. 

The BIC formula relies on an unbiased estimate for the degrees of freedom for 


the Exclusive Lasso. We leverage techniques used by Stein (1981) and Tibshirani 


et ah (2012) to calculate the degrees of freedom, but defer the proof to the appendix. 


Our formula leads to an unbiased estimate for the degrees of freedom that we use for 
both the BIC and the EBIC. Recall that the matrix Ms is a block diagonal matrix 
where each nonzero block Mg is the outer product of the sign vector of the estimate. 

Mg = sign{(3sr]g)sign{(3 sog)^ ■ This leads to our statement of the degrees of freedom 
for y: 

Theorem 5. For any design matrix X and regularization parameter X > 0, if y is nor¬ 
mally distributed, then the degrees of freedom for X (3 is df{y) = E \frace{Xs{X'^Xs + \MsyXj)'j 

An unbiased estimate of the degrees of freedom is then 


dfiy) = trace[XsiX^Xs + XMs^Xj]. (11) 

To verify this result, we compare our unbiased estimate of the degrees of freedom 


to simulated degrees of freedom following the set up outlined in Efron et ah (2004) and 


Zou et al. (2007) . Recall that for Gaussian y, the formula for the degrees of freedom 


can be stated as df{y) = yi)/a‘^. This formula points to a convenient way to 

i=l 

simulate the degrees of freedom. We let (3* be the true parameter and we simulate y, 
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B times such that = X/3* + where ~ -^(0,1). We then calculate an estimate 
for the covariance. Because y is standard Gaussian with = 1, the simulated degrees 

^ n 

of freedom is df{y) = I where we simulate the covariances according 

i=l 

B 

to covi = ~ ~ [^/d*]i)- Note that is the hat matrix for the 

b=i 

estimate y’^. In other words ]E[^^] = Xs{XgXs + XMyXgXiS* = H^Xjd* (where S 
here depends on the estimate at iteration h). In our simulations, we set B = 2000 
and found that empirically, our unbiased estimate of the degrees of freedom closely 
matches the simulated degrees of freedom (Figure 3). 



Simulated Degrees of Freedom 


Figure 3: Comparison of our estimate for the degrees of freedom to the simulated 
degrees of freedom. The simulated degrees of freedom matches the estimated degrees 
of freedom very closely. 


We can now use our unbiased estimate of the degrees of freedom to develop a 
model selection method for the Exclusive Lasso based on the Bayesian Information 


Criteria (BIC) (Schwarz et al., 1978) and the Extended Bayesian Information Criteria 


(EBIC) (Chen and Chen, 2008). Recall that while the BIC provides a convenient and 
principled method for variable selection, it can be too liberal in a high dimensional 


setting and is known to select too many spurious variables. Chen and Chen (2008) 
address this with the EBIC approach. Hence, we present both the BIC and EBIC 
for our method, noting that the latter is preferable in high-dimensional settings. If 
we assume the variance of y is unknown, the respective formulas for the BIC and the 


16 


























EBIC are 


and 


BIG 



+ df(^) 


log(^) 

n 


( 12 ) 


EBIC = log {+ df( 9 )!^ + dt(y)!^ (13) 

\ n J n n 

These formulas for the BIG and the EBIG can be used to select A for the Exclusive 
Lasso in practice. Usually, we can select A sufficiently large to select exactly one 
variable per group. In cases where the design matrix does not permit selecting one 
variable per group, (as discussed in Sections 6 and 7) we suggest using the BIG or 
EBIG to select A and then thresholding the estimate within each group so that there 
is only one variable per group. We call this group-wise thresholding. 


6 Simulation Study 

We study the empirical performance of our Exclusive Lasso through two sets of sim¬ 
ulation studies: hrst, for selecting one variable per group and second, for selecting a 
small number of variables per group. We examine three situations with moderate to 
large amounts of correlation between groups and within groups. We omit the low cor¬ 
relation setting from the simulations because they correspond to design matrices that 
are nearly orthogonal, satisfying both the Incoherence condition and the Faithfulness 
condition. This is not representative of the types of real data for which we would 
need to use the Exclusive Lasso and is uninteresting because all methods perform 
perfectly, selecting all of the truly nonzero variables and none of the false variables. 

In the first simulations, we simulate data using the model y = X[i* + e where 
e ~ .A^(0,1) and (3* is the true parameter. The variables are divided into five equal 
sized groups and the true parameter is nonzero at one index in each group and zero 
otherwise. We use three design matrices each with n = 100 observations and p = 100 
variables, to test the robustness of the Exclusive Lasso to within group correlation and 
between group correlation. All three matrices are drawn from a multivariate normal 
distribution with a Toeplitz covariance matrix with entries Sjj = for variables in 
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the same group, and T^j = for variables in different groups. The hrst covariance 
matrix uses constant b = .9 and tc = .9 to simulate high correlation within groups 
and high correlation between groups. The second covariance matrix uses b = .6 and 
w = .9 so that the correlation between groups is lower then the correlation within 
groups, resulting in high correlation within group and medium correlation between 
groups. The third covariance matrix uses constants w = .6 and 6 = .6 so that there 
is medium correlation both between group and within group. 

We compare two versions of our Exclusive Lasso as described in the previous 
section. First, we use a regularization parameter A, large enough to ensure that the 
method selects exactly one element per group. In these simulations, A = maxlXfyl 

i 

was large enough to ensure the correct structure was estimated; we refer to this as 
the Exclusive Lasso. The second estimate, the Thresholded Exclusive Lasso, chooses 
the regularization parameter A that minimizes the BIC and then thresholds in each 
group keeping the index with the largest magnitude. We also compare our method 
to competitors and logical extensions of competitors in the literature. We base three 
comparison methods on the Lasso: First, we take the largest regularization parameter 
that yields exactly hve nonzero coefficients (Lasso); second, we take the largest A that 
has nonzero indices in each group and then threshold group-wise to keep the coefficient 
in each group with the largest magnitude (Thresholded Lasso); third, we take the hrst 
coefficient along the Lasso regularization path to enter the active set from each group 
(Thresholded Regularization Path). Our hnal two comparison methods use Marginal 
Regression: First, we take the hve indices that maximize \Xfy\ (Marginal Regression); 
second, we take the one coefficient in each group that maximizes \Xjy\ for i E g { 
Group-wise Marginal Regression). For all methods we select a set of variables S, and 
then use the data matrix restricted to this set Xs to calculate an Ordinary Least 
Square estimate Pg. The prediction error is calculated using Pg. Results in terms of 
prediction error and variable selection recovery are given in Table 1. 

The thresholded version of the Exclusive Lasso outperforms all other methods at 
all levels of correlation, likely because it selects more variables that are truly nonzero. 
We observe that the thresholded estimators generally perform better then the non 
thresholded estimators. Among non-thresholded estimators, the Exclusive Lasso also 
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Exclusive 

Lasso 

Lasso 

Marginal 

Regression 

Group-wise 

Marginal 

Regression 

Thresholded 

Exclusive 

Lasso 

Thresholded 

Lasso 

Thresholded 

Regularization 

Path 

w=.9, b=.9 


True Vars 

2.180 {1.02) 

2.160 (0.82) 

1.340 (0.63) 

1.500 (0.84) 

3.760 (0.96) 

1.760 (1.06) 

2.080 (0.97) 


False Vars 

2.820 (1.02) 

2.840 (0.82) 

3.660 (0.63) 

3.500 (0.84) 

1.240 (0.96) 

3.240 (1.06) 

2.920 (0.97) 


Pred Er 

1.351 (0.15) 

1.433 (0.13) 

1.608 (0.14) 

1.411 (0.12) 

1.115 (0.13) 

1.411 (0.17) 

1.325 (0.15) 

w=.9, b=.6 


True Vars 

3.86 (0.88) 

3.700 (0.81) 

2.10 (0.74) 

4.020 (0.82) 

4.480 (0.68) 

4.060 (1.10) 

3.96 (0.90) 


False Vars 

1.14 (0.88) 

1.300 (0.81) 

2.90 (0.74) 

0.980 (0.82) 

0.520 (0.68) 

0.940 (1.10) 

1.04 (0.90) 


Pred Err 

1.11 (0.10) 

1.236 (0.17) 

1.55 (0.16) 

1.102 (0.11) 

1.064 (0.09) 

1.129 (0.15) 

1.10 (0.11) 

w=.6, b=.6 


True Vars 

4.720 (0.50) 

4.600 (0.53) 

3.620 (0.53) 

4.200 (0.49) 

4.940 (0.24) 

4.720 (0.45) 

4.740 (0.44) 


False Vars 

0.280 (0.50) 

0.400 (0.53) 

1.380 (0.53) 

0.800 (0.49) 

0.060 (0.24) 

0.280 (0.45) 

0.260 (0.44) 


Pred Err 

1.066 (0.15) 

1.094 (0.15) 

1.304 (0.15) 

1.162 ( 0.15) 

1.022 ( 0 . 10 ) 

1.062 (0.13) 

1.057 (0.13) 


Table 1: We compare the Exclusive Lasso and a thresholded version of the Exclusive 
Lasso to alternative variable selection methods as described in the Simulation section. 
Here, there is one nonzero coefficient in each of the hve groups, n = 100 and p = 100, 
and we vary the amount of between (b) and within (w) group correlation of the design 
matrix with Toeplitz covariance. The Thresholded Exclusive Lasso outperforms all of 
the competing methods in both the recovery of truly nonzero variables and prediction 
error. 


performs the best at all levels of correlation. These simulations highlight the Exclusive 
Lasso’s robustness to moderate and large amounts of correlation, which is important 
considering we expect variables in the same group to be similar and possibly highly 
correlated with each other. 

In the second set of simulations, we also simulate data using the model y = X13*+e 
where e ~ iV(0,1) and /3* is the true parameter for n = p = 100. In these simulations 
the variables are divided into the same hve equal-sized groups but the true parameter 
can be nonzero at more then one index in each group. Specihcally, there are seven 
nonzero coefficients distributed so that three groups have exactly one nonzero index 
and two groups have two nonzero indices each. We simulate the design matrices in the 
same way we simulate design matrices in the hrst set of simulations to have varying 
levels of between and within group correlation. 
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We compare three methods: the Exclusive Lasso, the Lasso, and the Lasso ap¬ 
plied independently to each group. For all methods, we use the BIC to select the 
regularization parameter. When we apply the Lasso separately to each group we use 
separate regularization parameters as well. 

Results in terms of prediction error and variable selection given in Table 2. 




Exclusive 

Lasso 

Group-wise Lasso 

w=.9, b=.9 







True Vars 

6.820 

(0.48) 

6.940 (0.24) 

4.920 (0.88) 


False Vars 

6.280 

(2.41) 

9.380 (3.08) 

5.880 (1.86) 


Pred Er 

1.262 

(0.22) 

1.295 (0.22) 

1.967 (0.64) 

w=.9, b=.6 







True Vars 

6.740 

(0.69) 

6.940 (0.24) 

4.780 (1.02) 


False Vars 

6.420 

(2.64) 

9.360 (3.35) 

6.180 (2.03) 


Pred Err 

1.232 

(0.22) 

1.259 (0.23) 

1.944 (0.54) 

w=.6, b=.6 







True Vars 

7.000 

(0.00) 

7.000 (0.00) 

6.720 (0.45) 


False Vars 

3.940 

(2.61) 

5.320 (3.80) 

2.080 (1.28) 


Pred Err 

1.197 

(0.19) 

1.233 (0.21) 

1.265 (0.29) 


Table 2: We compare the Exclusive Lasso to the Lasso and the Group-wise Lasso with 
BIC model selection for the second simulation scenario where we have hve groups with 
either one or two true variables per group for a total of seven true variables. Again, 
n = 100 and p = 100 with the amount of between and within group correlation of 
the design matrix is varied. The Exclusive Lasso performs best in terms of variable 
selection and prediction error. 

The Exclusive Lasso has the best prediction error across all three simulations. 
The Exclusive Lasso selects fewer false variables then the Lasso and selects more true 
variables then the Group-wise Lasso. These simulations also suggest the Exclusive 
Lasso is more robust to high levels of correlation. Overall, our results suggest that the 


20 



Exclusive Lasso performs best at within group variable selection when we have known 
group structure with relatively large amounts of correlation within and or between 
groups. 


7 NMR Spectroscopy Study 


Finally, we illustrate an application of the Exclusive Lasso for selecting the chem¬ 
ical shift of molecules in Nuclear Magnetic Resonance (NMR) spectroscopy. NMR 
spectroscopy is a high-throughput technology used to study the complete metabolic 
prohle of a biological sample by measuring a molecule’s interaction with an external 


magnetic held (De Graaf, 2013 Cavanagh et ah, 1995). This technology produces a 


spectrum where the chemical components of each molecule resonate at a particular 
ppm. See Figure 4.b for example. A central analysis goal of NMR spectroscopy is 
identifying and quantifying the molecules in a given biological sample. This is chal¬ 
lenging for numerous reasons discussed in (Ebbels et ah, 2011| Weljie et ah, 2006 


Zhang et ah, 2009). We seek to use the Exclusive Lasso to solve one of the major 


analysis challenges with NMR spectroscopy: accounting for positional uncertainty 
when quantifying relative concentrations of known molecules in a sample. Known as 
“chemical shifts”, every molecules’ chemical signature is subject to a random trans¬ 
lation in ppm (Figure 4.a) due to the external physical environment of the sample 


(De Graaf, 2013) . One way to model this positional uncertainty, is to create an 


expanded dictionary of shifted molecules to use for quantihcation. With this ex¬ 
panded dictionary, we can consider each molecule and its shifts as a group, and use 
the Exclusive Lasso to select the best shift of each molecule for quantihcation. 

We choose not to use real NMR spectroscopy data as often true molecules and 
true concentrations are unknown. Instead we create a simulation based on real NMR 
molecule spectra in order to test our method for the purpose of NMR quantihca¬ 
tion. In our application, we simulate an NMR signal using a dictionary of reference 
measurements for thirty-three unique molecules. The dictionary, X G 
consists of spectra for thirty-three molecules and ten artihcial positional shifts for 
each molecule, hve left and hve right. These shifts are no more then .05ppm greater 
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Figure 4: (a) Positional uncertainty of the chemical shift for the molecule Carnosine. 
All NMR spectroscopy signals are subject to random translations in ppm, due to the 
chemical environment of the sample, (b) NMR spectra of a neuron cell sample. NMR 
spectroscopy measures concentrations of all molecules in a sample. The observed 
signal is a linear combination of its unobserved component molecule’s chemical sig¬ 
natures. 


than or less than the reference measurement yielding eleven possible positions for 
each molecule. We use one randomly selected shift for each molecule, hence simu¬ 
lating the positional uncertainty found in real data. The columns of this expanded 
dictionary are strongly correlated with each other. Molecules are correlated with their 
ten shifts as well as other molecules with similar chemical structures. If we consider 
each molecule and its shifts a group, this results in a data set that has high correlation 
between groups as well as high correlation within each group as seen in Figure 5.a. 

The simulated NMR signal, ?/, is a linear combination of the molecules in the 
dictionary with values chosen so that the signal has several properties that we observe 
in real data. For example, real NMR data can contain several unique molecules. Many 


of these will resonate at similar frequencies, causing peaks to overlap (De Graaf 


2013). Informally, this yields signals that appear smoother with less pronounced 


peaks because of the crowding. With thirty-three molecules we can recreate this 
effect in the region between .5 and 0 ppm (see Figure 5.b). We then simulate our 
signal using positive noise so that y = X(3* + e where e is the absolute value of 
Gaussian noise; this is done as real NMR spectra is non-negative. 

We then use each method, the Exclusive Lasso, the Lasso, and the Group-wise 
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Mean Squared Error (/3) 

Prediction Error 

Exclusive Lasso 

1.072(.03) 

1.339e-04(9.797e-07) 

OLS regression 

2.871(.06) 

2.605e-04(1.162e-06) 

Marginal Regression 

1.163(.23) 

1.452e-04(1.841e-05) 

Lasso 

2.092(.14) 

8.025e-05( 1.091e-05) 


Table 3: In our simulation using NMR spectroscopy data, we seek to quantify con¬ 
centrations of molecules in a sample (see MSE{P)) under positional uncertainty in 
chemical shifts. Here, OLS regression quantihes concentrations without account¬ 
ing for positional uncertainty whereas the Exclusive Lasso, Marginal Regression and 
the Lasso account for positional uncertainty by selecting one chemical shift for each 
molecule from an expanded dictionary. Given the selected variables, S, these methods 
use OLS estimates for Xg to estimate (3 and quantify concentrations, the accuracy of 
which is measured by MSE{/3) = ^||/3 — 

Lasso, to select a set of variables S, consisting of one shift from each molecules’ group 
of chemical shifts. Where applicable we use the thresholded versions of the estimates 
where we select A using the BIG and threshold group-wise so that there is only one 
nonzero variable in each group. Finally, we compare these methods to an ordinary 
least squares estimate that uses the original un-expanded dictionary without modeling 
the positional shifts. In Table 3, we report the prediction error and mean squared 

p 

error, MSE = “ [i.^s^s )~^so that we can accurately compare the 

^i=i 

methods as variable selection procedures. This measure eliminates the shrinkage that 
occurs with penalized regression methods and allows us to focus on how accurately 
we recover the concentrations of each molecule. 

Among all methods, the Exclusive Lasso performs best at quantifying molecule 
concentrations under positional uncertainty. This case study highlights a real ex¬ 
ample where there is high correlation both within and between pre-dehned groups. 
Gonsistent with our simulation studies, the Exclusive Lasso performs best in these 
situations. 
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(a) (b) 


Figure 5: (a) The covariance matrix for the expanded dictionary of molecules. We 
simulate chemical shifts by generating 10 lagged variables for each of the 33 molecules 
(blocks on the diagonal). A molecule and its 10 shifts comprise a group where each 
variable in the group is very correlated with every other member in the group. We can 
also see that the molecules are very correlated with each other as we include molecules 
that are chemically similar, (b) The simulated NMR signal and the signal estimated 
using the Exclusive Lasso. The estimate recovers most of the peaks suggesting it is 
selecting a useful set of shifts. The estimate also zeros out most of the noise in the 
simulated signal. 
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8 Discussion 


Although others have introduced the Exclusive Lasso penalty, we are the first to in¬ 
vestigate the method’s statistical properties in the context of sparse regression for 
within group variable selection. We propose two new characterizations of the Ex¬ 
clusive Lasso in an effort to understand the estimate. The first characterization is 
an explicit definition of [3 in terms of the support set that allows us to derive the 
degrees of freedom. This expression is similar to that of the ridge regression estimate, 
especially when there is exactly one nonzero variable in each group. The second char¬ 
acterization allows us to explore the properties of the active set. We then prove that 
the Exclusive Lasso is prediction consistent under weak assumptions, the first such 
result. Additionally, we develop a new algorithm for fitting the Exclusive Lasso based 
on proximal gradient descent and derive the degrees of freedom so that we can use 
the BIC formula or the EBIC formula for model selection. 

Overall, we find that the Exclusive Lasso compares favorably to existing methods. 
Even though the Exclusive Lasso is a more complex composite penalty, convergence 
results for the Exclusive Lasso Algorithm are comparable to convergence rates for 
standard first order methods for computing the Lasso. Additionally, through several 
simulations, we find that the Exclusive Lasso not only selects at least one variable 
per group better then any existing method, but it also performs better when there is 
strong correlation both within groups and between groups. 

In this work, we focus on statistical questions important to the practitioner, but 
there are several directions for future work. Investigating variable selection consis¬ 
tency, overlapping or hierarchical group structures, and inference are important open 
questions. One could also use the Exclusive Lasso penalty with other loss functions 
such as that of generalized linear models. Additionally, there are many possible ap¬ 
plications of our method besides NMR spectroscopy such as creating index funds in 
finance, and selecting genes from functional groups or pathways, among others. 

Overall, the Exclusive Lasso is an effective method for within group variable se¬ 
lection in sparse regression; an R-package will be made available for others to utilize 
our method. 
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9 Appendix 


Proof of theorems 1 and 2 


The proof of theorems 1 and 2 follows the proof technique presented in Chatterjee 


(2013). There are several differences due to the structure of our penalty, however, 


the assumptions are the same. We assume that the columns of the design matrix 

{Xi... Xp} are possibly dependent random variables such that the covariance matrix 

for {Xi... Xp} is S. We assume the entries of X are bounded so that |Wj| < 

M and that the data we observe (Y'i,Xi)... (W,X„) is independent and identically 

distributed. We also assume the value of the penalty evaluated at the true parameter 

is bounded so that P{P*) < K and that the response is generated by the linear model 

Y = X(3* + e where e ~ iV(0, cr^). Let ^ be a collection of predehned non overlapping 

groups such that U g = {1.. .p}. 
g&S 

Instead of the Exclusive Lasso penalty, we work with the equivalent constrained 
optimization problem 

/3 = argmin ||y — X/3||2 
P-P(P)<K 

Let C = {X/d : P(/3) < K}. By dehnition, Y is the projection oiY onto the set C. 
For constrained optimization problems hrst order necessary conditions for an optimal 
solution state that for all d in the linear tangent cone a solution to the problem 
X* necessarily satishes f'{x*\d) > 0. In our case the linear tangent cone is the set 
T(^{Y) = {{x — Y) : a; G C} so an optimal solution satishes {—{Y — Y), (x — Y)) > 0 for 
all X ^ C. Letting x = Y* we can rewrite {{Y — Y), {Y* — Y)) < 0 as the inequality 


\\Y* - Y\\l< {{Y - Y*),{Y - Y*)) 

n / p 

i=i \j=i 
P / n 

j=i \i=i 

^ p ^ 

Our assumption P{(3*) < K and the dehnition of (3 let us bound so 


that 


i=i 
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Y.01 -P‘, ) < 2(K + 161) 

n 

This implies that if we let Uj = then 

i=l 


<2(A'+|6|)max|C/^| 

1<6<P 


Because t/j ~ iV | 0, j we have the bound 

i=l 


E( max |t/, |) < M(T\/2n log(2p) 

i<j<p 


See lemma 3 in Chatterjee (2013) for proof of the bound. Therefore 


E||r* - F|p < 2{K + \g\)Ma^/2n\og{2p) 
which gives us theorem 2: 


E[MSPE0)] < 2{K+\g\)Ma 


2 log(2p) 


n 


We use this result to prove theorem 1. By the independence of the data (Y,X) 
and g we have 

E(y - Yf = (/?• - - /?»)E(VW) 


j,k=l 


note that 


1 

n' 


■IIV - yf = E (/’I - - MXjX, 

j,k=l 

Combining these two expressions yields 


E(y - X - i||y - yf =Y.(P’,- p,))pi - /jjpivw) - Ivwi 

j,k=l 


n 


We then define Vj^k = [E(XjXfc) — ^XjXk\ and note that it is bounded \Vj^k\ < 
2M^. By Hoeffding’s inequality 
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2 log(2p2) 


E( max < 2M^ 


n 


We use a version of Hoeffding’s inequality that is rather uncommon so we refer the 


interested reader to the appendix of Chatterjee (2013) for a derivation of the result. 
Finally 


E(y - Yf - i||F* - y||^ < 4(A' + 151)" m:g 

Combining our results yields theorem 1 


¥.{Y* -Yf < 2{K+\g\)Ma 


2 log(2p) 


n 


+ %{K + \g\fM^^ 


2 log(2p2) 


n 


Proof of corollary 1 

The MSPE0) is equal to E|| f — /9*||e- We can bound || f —/3*||2 by the MSPE such 
that II f —/d *||2 < ^11 f ~/^*lll showing that || f —/d *||2 goes to 0 as MSPE0) goes to 

0 . 


Proof of theorem 3 

Our coordinate descent algorithm calculates the proximal operator by solving the 
optimization problem 


proxp(|/) = argmin-|||/ — a :||2 + AP(x) 

X ^ 

We show that the assumptions for theorem 4.1 from Tseng ( 2001[ ) hold for the 
problem above. For a function of the form 


f{x) = g{x) + h{x) 

where g is convex and differentiable and h is convex but not necessarily differen¬ 
tiable, verifying the assumptions involves showing that 


















1. The differential part of our function g satisfies assumption (Al) from Tseng 


pMI|) 


Assumption: (Al) The domain of g is open and g is Gateux differentiable 

2. The function / is a regular function. 


3. The level set Xq = {x : /(x) < /(x°)} is compact and that / is continuous on 
Ao 

4. For every pair i,k & {1.. .p} it follows that / is jointly pseudo convex in Xj and 
Xk 

First we state several definitions. 

We say direction d is a vector in M”. We allow dk to be the scalar in the position 
in the vector (0 ... 0, 0 ... 0). We abuse notation if the meaning is unambiguous, 

and also let dk denote the entire vector with Os in all positions except for the k^^ 
position. It is typical to define first order optimality conditions in terms of the 
Gateaux derivative. We however use the more general forward variation defined as 
follows: 


Definition 1. For a function f the forward variation in direction d at x is 

/(x + td) -/(x) 


/i(x; d) = lim- 


t 


The Gateaux derivative exists if both the forward and backward variation exist and 
are equal. Tseng uses the Gateaux derivative to define his optimality conditions but 
for our unconstrained convex non-differentiable problem it is necessary and sufficient 
for a minimizer of / to satisfy /(_(x; d) > 0 for all d G M”. We also use a notion called 


regularity. Note that this is the same definition of regularity given in Tseng (2001) 
communicated here for convenience. Throughout the rest of the paper we use the 
forward variation and the directional derivative interchangeably. 
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Definition 2. A function f is regular at x if f\x-, d) > 0 for all d such that f'{x\ dk) > 
0 


Regularity ensures that if we have a point that minimizes f coordinstewise, then 
the point minimizes the function f. 

Definition 3. A function f is pseudoconvex if f{x + d) > f{x) whenever x G dom{f) 
and f'{x] d) > 0 

Assumption 1: The differential part of our function g satisfies assumption (Al) 


from Tseng (2001) 


Proof. If we let 


9 {x) = -\\y-x\ 


its domain is M"' which is an open set. We must also show that g{x) = ||||/ — x\ 

is Gateux-differntiable on MA. 

, g{x + td)-g{x) 

q [X- d) = hm- 

= -iy - x^d 
= ^gixYd 

A similar argument holds as f t 0 


□ 


Assumption 2: the function / is a regular function 


Proof. Our goal is to show that if we have a point x that minimizes / point wise i.e. 
that f\x] dk) > 0 for all dk then we have a point that minimizes / and satisfies the 
standard first order necessary and sufficient condition for optimality f'{x] d) > 0 for 
all d We know that g{x) = ^|||/ — a ;||2 is Gateux-differntiable on W^. 

Next we show that the entire function /(x) = g{x) + h{x) is regular. Assume that 
the point x minimizes / point wise therefore satisfying: 


fix; ( 0 ... 0 , 4 , 0 ... 0 )) >0 


for all dk- Then it follows that 
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f'{x] d) = Vg{x)^d + - — - 


/ n 

( Eki + tdi 

'\/g{x)'^d + - 


Ekil Ek* + ^^i| + Eki 

i=l J \i=\ i=l 


40 


t 


= Vg(x)'^d + lim 


( n 

Y.\^i + tdi 

i=l 


Eki 

2 = 1 


40 


lim 

40 


'^\xi + tdi\ + ^ 


. 2=1 


Xj 


2=1 


Y,\Xi + tdi\ - Y.\^i\ 

= WgixY'd + lim^=^^- — - 2\\x 

40 t 

2=1 

n 

= 5^/'(a;;(0,...,0,4,0,...,0)) 

i=l 
> 0 


□ 


Assumption 3: The level set Xq = {x ■. f{x) < f{x^)} is compact and that / is 
continuous on Xq 

Proof. We show that the function is continuous by showing that the penalty is con¬ 
tinuous and that the differentiable part of the objective function is continuous. Let 
x,y E Xq then there exists a S such that for 


X — y\ < S 


it follows that 


\P{x)-P{y)\<e 


To find 6 consider 
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\P{x) -P{y)\ < p{x-y) 

gee ieg 

< Jl&f 

g&G i&g 

Note that the hrst line follows from the reverse triangle ineqnality. If i G 5 ^ then 
for any e > 0 we can dehne 5 such that 5i = —'^= which shows that the penalty is 

«9\/|G| 

continuous on the set. 

To show that the term \\y — x \\2 is continuous consider two points x,z E Xq and 
suppose 


\x 


z\ < S 


Consider 


\\\y 


x\ 


Ib-^llal < 11 ( 2 /-a;) -{y- z) 
= 11 ^ - 

= C^\Xi-Zi\f 


< 




So for 6i < ^ the term \\y — x\\l is continuous. Therefore / is continuous be¬ 
cause the sum of continuous functions is a continuous function. Using theorem 1.6 of 


Rockafellar and Wets (2009), continuity implies that the level sets are closed. 


The level sets also must be bounded. For any level set 


Xo = {x : \\y - x\\l -h AP(x) < \\y - Xo\\l AP(xo)} 

If we let II 2 / — X 0 II 2 + AP(a;o) = a we can consider a vector of the form Xa = 
(0,..., 0, 0,..., 0). Our penalty evaluated at this vector gives XP{xa) = |a| + 

1 > a. Since \\y — x|| >0 for all x G M” the objective function f{xa) > cx . This 
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implies that for all x G Xq there exists an M G M such that max|a;i| < M. Therefore 

i 

the level sets are bounded. 

By the Heine-Borel theorem since Xq a closed bounded subset of it is compact. 

□ 

Assumption 4: For every pair i,k G {1.. .p} it follows that / is jointly pseudo- 
convex in Xi and Xk- 

Proof. For any pair of indices fc G {1... p} the function 

\\y-x\\l + \P{x) 

is jointly convex in Xi and Xk- Suppose indices i and k are in the same group. We 
can rewrite the objective function as 

fi{xi,Xk) = \\x\\l - 2y^x + y^y + \^C^\xj\Y 

g&G j&g 

= xj + xl + XiCo + XkCi + {Xi + Xk)"^ + C2 
where cq, ci, C 2 are terms constant in Xi and Xk and yi^k = (Pi, Pfc) and Xi^k = {xi, Xk) 
are the vectors restricted to indices i, k. Both the £2 norm and the affine function 
of Xi^k are convex. The function fi{xi,Xk) has a positive semidehnite hessian so it is 
also convex. 

li i,k are in different groups we rewrite the objective function as 


f2{xi, Xk) = 2x1 + + CiXk + C 2 

Function /2 also has a positive semidehnite hessian so it is also convex. 


Therefore the function / is convex in every pair of indices which implies that it is 
pseudoconvex in every pair of indices. 

□ 


Given that the objective function satishes all of the assumptions for Tseng (2001) 
Theorem 4.1 we can say that our coordinate descent algorithm converges to a station¬ 
ary point. Because our function is convex the stationary point is a global minimum. 
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Proof of theorem 4 


Our result depends on work by Schmidt et al. (2011). We seek the convergence 
rate for the our Exclusive Lasso algorithm. In our algorithm at each step k the 
proximal operator is computed to within a small error such that the iterate Xk = 
Cfc + argmin||?/ — xW^ + XP{x). As long as the sequence of errors is summable the 


X 

algorithm will converge at a rate of at least 0{l/k) when the following assumptions 
hold. For function f{x) = g{x) + h{x) we assume 


1. The function g is convex with a lipschitz-continuous gradient. 

2. The function h is a lower semi-continuous proper convex function. 

3. There exists a point x* G M that minimizes /. 

4. The points Xk are e^-optimal solutions to the proximal operator optimization 
problem at iteration k. 


We must verify that these assumptions hold for the Exclusive Lasso 


Assumption 1: In our case g{(3) = \ \\y — W/3||2 so 

\\\y - X/?i ||2 - \\y - X^ 2 h\ < 11 ( 2 / - - {y - X^ 2 )h 

= \\X{^i-m2 

<\\xum-m2 

= \^a.{X^X)\\{l3^-l32)h 

which implies that g is lipschitz- continuous with lipschitz constant L = Xmax{X'^X) 
the largest eigenvalue of X'^X. 


Assumption 2: Because ||x||i is continuous for all x G M” and b{z) = is con¬ 
tinuous for all G M their composition ||x||^ is continuous at all points in M"’. To 
show that the penalty is convex we will consider the convexity of /(x) = ||x||p For 
t G [0,1] and x, z G M"' 
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\\tx + (1 - t)z\\l < (t||a;||i + (1 - t)||^||)^ 

Therefore f{x) = ||x||^ is convex. The convexity of P{P) follows from the fact that 
the sum of convex functions is also convex. 

The penalty is proper by dehnition since for all x G M"" we have P{x) ^ oo 


Assumption 3: Using theorem 1.9 from Rockafellar and Wets ( |2009 ) we show 
existence of a solution. We need the level sets = {x : /(x) < a} to be bounded 
for all a G M. Consider a vector of the form /3„ = (0,..., 0, 0,..., 0). Our 

penalty evaluated at this vector gives \P{(3a) = |a| + 1 > a. Since \\y — X/3|| > 0 for 
all (3 G M"" the objective function f{/3a) > a . This implies that for all x G Xa there 
exists an M G M such that max|xj| < M. Therefore the level sets are bounded. 

i 

We have already shown that both g and h are continuous so their sum must also 
be continuous. Therefore because the level sets of our function / are bounded, and 
/ is continuous and proper by theorem 1.9 there exists a minimum to our objective 
function /. 

Assumption 4: This assumption holds by theorem 3. 

Therefore by proposition 1 from Schmidt et ah ( 2011[ ) the Exclusive Lasso algo¬ 
rithm converges at a rate of 0{l/k). 


Proof of theorem 5 

For a continuous and almost differentiable function g, Steins formula 

df{g)=E[iX*g)iy)] 

dehnes the degrees of freedom for normal random variables in terms of the function 
(V * g). The function (V * g) known as the divergence is dehned for g : —>■ as 
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To derive the degrees of freedom for the Exclusive Lasso problem we need to prove 
that the estimate is a continuous and almost differentiable function of y. Tibshirani 
provides a lemma stating that 


Lemma 2. For a convex set C C M"’ the projection map Pc and the map I — Pc are 
continuous and almost differentiable. 


For proof see Tibshirani et al. (2012) 


Lemma 3. The estimate X jd = [I — Pc)y for the set 


C = {m e M” : P*{X^u) < a} 


where 


P*W) 



2 

oo 


is the dual norm of the square root of our penalty and a is a constant. 


Proof. The dual norm of a norm ||^|| is dehned as the norm ||x||* such that \\z 
sup{(x, 2 ;) : ||x||* < 1}. Note that for the square root of our penalty 



/ sign0)\\^gj\i A 


This means that our dual norm is the norm such that P* 
holds for the norm 


< 1 which 


p*{fd) 



2 

oo 


We show that 9 = y — X (d is equal to the projection of y onto the set C. The 

projection 6 = Pc{y) can be characterized as a point 6 satisfying the hrst order 

optimality conditions for the constrained optimization problem min|||/ — OWl- The 

OgC 

hrst order optimality conditions are 
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fiO;d)>0 
{y — 9,9 — u) >0 

for all M G C 

We must verify that f'{9; d) >0. If we let 6^ = y — X jdiy) then 


1 

1 

s 

(14) 

= {Xp,y-XP)-{X^u,P) 

(15) 

=P0) - 

(16) 

= max {w, B) — {X'^u, B) 

( 17 ) 

P*(w)<‘x 


> 0 

(18) 


Line 3 follows from the fact that there exists a regularization parameter such that 
the necessary conditions for the Exclusive Lasso problem are exactly the same as 
the necessary conditions for the optimization problem that uses the square root of 
the Exclusive Lasso penalty. Notice that if we let a = 2AP(/3)^ then \dP0) = 
ad\JP0). This implies that necessarily satishes 

-X^{y -X(3)+ ad^P0) = 0 

Taking the inner product with jS yields 

Line 5 follows for the set C = {n G : P*{X'^u) < f } proving that y — X ^ is 
equal to the projection of y onto the set C. This implies that X jd = {I — Pc)y 

□ 

Combining Lemmas 1 and 2 yields that the exclusive lasso estimate is continuous 
and almost differentiable. Next we dehne /5 in terms of the support set S. First recall 
the KKT conditions 
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-X^{y-Xi3) + \z = 0 

where 

_{ sign0i)\\j3g\\i :Pi^O,ieg 

\ ~ll lib II ^g 111 : A = 0 

Note that we can rewrite the sub gradient for the indices i & g H S. If we let 
Sgns = sign0g^s) 

^gns = ^gnsSgns i^gnS 

We can write the sub gradient over the indices of the support as 

zs = Ms 

where Ms is a block diagonal matrix with the matrices {sgnss0s ■ g ^ G} on the 
diagonal. 

We can rewrite the KKT conditions with respect to the support set 



This is equal to 

-X^y + X^XsPs+^zs = 0 
~^s<^y + Xs<=^s Gs = 0 
We then solve for Ps using zs = Ms (3s yielding 

^Ps = {XlXs + \Ms)^Xly 

Note that we are relying on the fact that we have already proved the existence of 
a solution to the optimization problem in the proof for theorem 4. This gives us an 
estimate y = Xs{X'gXs + XMs)^X'^y. The divergence is therefore 

(V * X/3)(|/) = trace[Xs{XlXs + \Ms)^Xl] 
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which is equal to the sum of the eigenvalues. 


Penalty 

For specific values of X and y the Exclusive Lasso will select more than one variable 
per group for all values of the regularization parameter A. This means that although 
the Exclusive Lasso is designed to select exactly one element per group we cannot 
guarantee the Exclusive Lasso will enforce the correct structure. Consider an example. 
Suppose we characterize the Exclusive Lasso estimate using the equicorrilation set. 
Recall the equicorrilation set 

If we let s be a vector such that Si = sign{l3^ for i G T and 7 be a vector such 
that 7j = II 13 111 where Qi is the group for an index i G T. Let 7 be a vector such 
that 7 j = II j3g. Ill — I A I then we can solve for /5. 

Xj{y-Xei3s) = \is 

= A 7 S + A/3£- 

(3s = {XjXe + \I)-\Xjy - \^s\ 

Let X = I 2 and we let y'^ = (1,1) then because X is orthonormal the estimate 
simplifies to 




A 


1 + A^ 1 + A 


/ 

7 s 


In this case /3i = ^ so the term is going fo shrink both indices equally for 

all A. This prevents the estimate from selecting exactly one element in each group. 

We conjecture that conditions on X and y for this to occur can be formalized, 
but this is beyond the scope of this work. Intuitively, this behavior occurs when two 
or more variables get shrunken equally. As such, this behavior is relatively rare in 
practice. If it does occur and one variable per group is desired, we propose to use 
BIC to select A and apply group-wise t 
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